171 research outputs found
Learning Grimaces by Watching TV
Differently from computer vision systems which require explicit supervision,
humans can learn facial expressions by observing people in their environment.
In this paper, we look at how similar capabilities could be developed in
machine vision. As a starting point, we consider the problem of relating facial
expressions to objectively measurable events occurring in videos. In
particular, we consider a gameshow in which contestants play to win significant
sums of money. We extract events affecting the game and corresponding facial
expressions objectively and automatically from the videos, obtaining large
quantities of labelled data for our study. We also develop, using benchmarks
such as FER and SFEW 2.0, state-of-the-art deep neural networks for facial
expression recognition, showing that pre-training on face verification data can
be highly beneficial for this task. Then, we extend these models to use facial
expressions to predict events in videos and learn nameable expressions from
them. The dataset and emotion recognition models are available at
http://www.robots.ox.ac.uk/~vgg/data/facevalueComment: British Machine Vision Conference (BMVC) 201
Large scale evaluation of local image feature detectors on homography datasets
We present a large scale benchmark for the evaluation of local feature
detectors. Our key innovation is the introduction of a new evaluation protocol
which extends and improves the standard detection repeatability measure. The
new protocol is better for assessment on a large number of images and reduces
the dependency of the results on unwanted distractors such as the number of
detected features and the feature magnification factor. Additionally, our
protocol provides a comprehensive assessment of the expected performance of
detectors under several practical scenarios. Using images from the
recently-introduced HPatches dataset, we evaluate a range of state-of-the-art
local feature detectors on two main tasks: viewpoint and illumination invariant
detection. Contrary to previous detector evaluations, our study contains an
order of magnitude more image sequences, resulting in a quantitative evaluation
significantly more robust to over-fitting. We also show that traditional
detectors are still very competitive when compared to recent deep-learning
alternatives.Comment: Accepted to BMVC 201
R-CNN minus R
Deep convolutional neural networks (CNNs) have had a major impact in most
areas of image understanding, including object category detection. In object
detection, methods such as R-CNN have obtained excellent results by integrating
CNNs with region proposal generation algorithms such as selective search. In
this paper, we investigate the role of proposal generation in CNN-based
detectors in order to determine whether it is a necessary modelling component,
carrying essential geometric information not contained in the CNN, or whether
it is merely a way of accelerating detection. We do so by designing and
evaluating a detector that uses a trivial region generation scheme, constant
for each image. Combined with SPP, this results in an excellent and fast
detector that does not require to process an image with algorithms other than
the CNN itself. We also streamline and simplify the training of CNN-based
detectors by integrating several learning steps in a single algorithm, as well
as by proposing a number of improvements that accelerate detection
Inductive Visual Localisation: Factorised Training for Superior Generalisation
End-to-end trained Recurrent Neural Networks (RNNs) have been successfully
applied to numerous problems that require processing sequences, such as image
captioning, machine translation, and text recognition. However, RNNs often
struggle to generalise to sequences longer than the ones encountered during
training. In this work, we propose to optimise neural networks explicitly for
induction. The idea is to first decompose the problem in a sequence of
inductive steps and then to explicitly train the RNN to reproduce such steps.
Generalisation is achieved as the RNN is not allowed to learn an arbitrary
internal state; instead, it is tasked with mimicking the evolution of a valid
state. In particular, the state is restricted to a spatial memory map that
tracks parts of the input image which have been accounted for in previous
steps. The RNN is trained for single inductive steps, where it produces updates
to the memory in addition to the desired output. We evaluate our method on two
different visual recognition problems involving visual sequences: (1) text
spotting, i.e. joint localisation and reading of text in images containing
multiple lines (or a block) of text, and (2) sequential counting of objects in
aerial images. We show that inductive training of recurrent models enhances
their generalisation ability on challenging image datasets.Comment: In BMVC 2018 (spotlight
Unsupervised learning of object landmarks by factorized spatial embeddings
Learning automatically the structure of object categories remains an
important open problem in computer vision. In this paper, we propose a novel
unsupervised approach that can discover and learn landmarks in object
categories, thus characterizing their structure. Our approach is based on
factorizing image deformations, as induced by a viewpoint change or an object
deformation, by learning a deep neural network that detects landmarks
consistently with such visual effects. Furthermore, we show that the learned
landmarks establish meaningful correspondences between different object
instances in a category without having to impose this requirement explicitly.
We assess the method qualitatively on a variety of object types, natural and
man-made. We also show that our unsupervised landmarks are highly predictive of
manually-annotated landmarks in face benchmark datasets, and can be used to
regress these with a high degree of accuracy.Comment: To be published in ICCV 201
- …